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Abstract 

We study a generalized framework for structured sparsity. It extends the well- 
known methods of Lasso and Group Lasso by incorporating additional constraints 
on the variables as part of a convex optimization problem. This framework pro- 
vides a straightforward way of favouring prescribed sparsity patterns, such as 
orderings, contiguous regions and overlapping groups, among others. Existing 
optimization methods are limited to specific constraint sets and tend to not scale 
well with sample size and dimensionality. We propose a novel first order proximal 
method, which builds upon results on fixed points and successive approximations. 
The algorithm can be applied to a general class of conic and norm constraints sets 
and relies on a proximity operator subproblem which can be computed explicitly. 
Experiments on different regression problems demonstrate the efficiency of the 
optimization algorithm and its scalability with the size of the problem. They also 
demonstrate state of the art statistical performance, which improves over Lasso 
and StructOMP. 

1 Introduction 

We study the problem of learning a sparse linear regression model. The goal is to estimate a parame- 
ter vector p* € R" from a vector of measurements y 6 K m , obtained from the model y = X/3* + £, 
where X is an m x n matrix, which may be fixed or randomly chosen, and £ € K m is a vector 
resulting from the presence of noise. We are interested in sparse estimation under additional con- 
ditions on the sparsity pattern of (3* . In other words, not only we do expect that f3* is sparse but 
also that it exhibits structured sparsity, namely certain configurations of its nonzero components are 
preferred to others. This problem arises in several applications, such as regression, image denoising, 
background subtraction etc. - see lfTTl l9ll for a discussion. 

In this paper, we build upon the structured sparsity framework recently proposed by fl2l . It is a 
regularization method, formulated as a convex, nonsmooth optimization problem over a vector of 
auxiliary parameters. This approach provides a constructive way to favour certain sparsity patterns 
of the regression vector j3. Specifically, this formulation involves a penalty function given by the 

formula Q(/3\A) = inf {j£<=i (f + A*) : A G a|. This function can be interpreted as an ex- 
tension of a well-known variational form for the l\ norm. The convex constraint set A provides a 
means to incorporate prior knowledge on the magnitude of the components of the regression vector. 
As we explain in Section |2j the sparsity pattern of j3 is contained in that of the auxiliary vector A at 
the optimum. Hence, if the set A allows only for certain sparsity patterns of A, the same property 
will be "transferred" to the regression vector (3. 

In this paper, we propose a tractable class of regularizes of the above form which extends the ex- 
amples described in [ 12 1. Specifically, we study in detail the cases in which the set A is defined by 
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norm or conic constraints, combined with a linear map. As we shall see, these cases include formu- 
lations which can be used for learning graph sparsity and hierarchical sparsity, in the terminology 
of [9|. That is, the sparsity pattern of the vector j3* may consist of a few contiguous regions in one or 
more dimensions, or may be embedded in a tree structure. These sparsity problems arise in several 
applications, ranging from functional magnetic resonance imaging [7, 22], to scene recognition in 
vision 1 8 1, to multi-task learning 0~|[17] and to bioinformatics |fl9l - to mention but a few. 

A main limitation of the technique described in [12| is that in many cases of interest the penalty 
function cannot be easily computed. This makes it difficult to solve the associated regularization 
problem. For example lfl2l proposes to use block coordinate descent, and this method is feasible 
only for a limited choice of the set A. The main contribution of this paper is an efficient proximal 
point method to solve regularized least squares with the penalty function £!(/3|A) in the general 
case of set A described above. The method combines a fast fixed point iterative scheme, which is 
inspired by recent work by 1 1 3 1 with an accelerated first order method equivalent to FISTA [4 1 . We 
present a numerical study of the efficiency of the proposed method and a statistical comparison of 
the proposed penalty functions with the greedy method of [9] and the Lasso. 

Recently, there has been significant research interest on structured sparsity and the literature on this 
subject is growing fast, see for example 0] [9] [TO] Qj] [23] and references therein for an indicative 
list of papers. In this work, we mainly focus on convex penalty methods and compare them to 
greedy methods (3] [9) . The latter provide a natural extension of techniques proposed in the signal 
processing community and, as argued in [9], allow for a significant performance improvement over 
more generic sparsity models such as the Lasso or the Group Lasso l23l . The former methods have 
until recently focused mainly on extending the Group Lasso, by considering the possibility that the 
groups overlap according to certain hierarchical structures fTTll24l . Very recently, general choices 
of convex penalty functions have been proposed ll2l lT2ll . In this paper we build upon [12], providing 
both new instances of the penalty function and improved optimization algorithms. 

The paper is organized as follows. In Section [2] we set our notation, review the method of ifTSIl 
and present the new sets for inducing structured sparsity. In Section [3] we present our technique 
for computing the proximity operator of the penalty function and the resulting accelerated proximal 
method. In Section[4] we report numerical experiments with this method. 



2 Framework and extensions 



In this section, we introduce our notation, review the learning method which we extend in this paper 
and present the new sets for inducing structured sparsity. 

We let R + and R++ be the nonnegative and positive real line, respectively. For every j3 € K™ we 
define |/3| 6 R™ to be the vector \f3\ = (|/3i|)™ =1 . For every p > 1, we define the £ p norm of (3 as 

WPWp = (ELi \Pi\ p )*- If C C R", we denote by 6 C : R" — > R the indicator function of the set C, 
that is, Sc(x) = if x G C and Sc(x) — +oo otherwise. 

We now review the structured sparsity approach of 1 12 1. Given an m x n input data matrix X and 
an output vector y 6 R m , obtained from the linear regression model y = X(3* + £ discussed earlier, 
they consider the optimization problem 

inf|i||X i 9- 2 /||| + pr( i 8 ) A):/3GR",AGA| (2.1) 

where p is a positive parameter, A is a prescribed convex subset of the positive orthant R™ , and the 
function T : R™ x R™ + — > R is given by the formula 

i— 1 x / 

Note that the infimum over A in general is not attained, however the infimum over (3 is always 
attained. Since the auxiliary vector A appears only in the second term and our goal is to estimate /?*, 
we may also directly consider the regularization problem 

min|i||A/3-y||2 + p r!(/3|A) :/3eR"|, (2.2) 
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where the penalty function takes the form f2(/3|A) = inf {T({3, A) : A £ A}. This problem is still 
convex because the function T is jointly convex 0. Also, note that the function 57 is independent of 
the signs of the components of f3. 

In order to gain some insight about the above methodology, we note that, for every A £ W^ + , the 
quadratic function A) provides an upper bound to the l\ norm, namely it holds that f2(/J|A) > 
||i and the inequality is tight if and only if |/3| £ A. This fact is an immediate consequence of the 
arithmetic-geometric inequality. In particular, we see that if we choose A = R™ ,, the method d2.2l) 
reduces to the LasscQ The above observation suggests a heuristic interpretation of the method ( |2.2t : 
among all vectors f3 which have a fixed value of the l\ norm, the penalty function will encourage 
those for which \/3\ £ A. Moreover, when |/3| £ A the function ft reduces to the l\ norm and, so, 
the solution of problem ( |2.2t is expected to be sparse. The penalty function therefore will encourage 
certain desired sparsity patterns. 

The last point can be better understood by looking at problem ( 12.11 ). For every solution (f3, A), the 
sparsity pattern of (3 is contained in the sparsity pattern of A, that is, the indices associated with 
nonzero components of f3 are a subset of those of A. Indeed, if Aj = it must hold that $i = 
as well, since the objective would diverge otherwise (because of the ratio /3|/Aj). Therefore, if 
the set A favours certain sparse solutions of A, the same sparsity pattern will be reflected on (3. 
Moreover, the regularization term ^\ A^ favours sparse vectors A. For example, a constraint of the 
form Ai > • • • > A„, favours consecutive zeros at the end of A and nonzeros everywhere else. This 
will lead to zeros at the end of (3 as well. Thus, in many cases like this, it is easy to incorporate a 
convex constraint on A, whereas it may not be possible to do the same with /3. 

In this paper we consider sets A of the form 



where S is a convex set and A is a k x n matrix. Two main choices of interest are when S is a convex 
cone or the unit ball of a norm. We shall refer to the corresponding set A as conic constraint or norm 
constraint set, respectively. We next discuss two specific examples, which highlight the flexibility 
of our approach and help us understand the sparsity patterns favoured by each choice. 

Within the conic constraint sets, we may choose S = so that A = {A £ : A\ > 0}, 

which can be used to encourage hierarchical sparsity. In |[T2l they considered the set A = {A £ 
M™ + : Ai > • • • > A„} and derived an explicit formula of the corresponding regularizer fi(/3|A). 
Note that for a generic matrix A the penalty function cannot be computed explicitly. In Section 3, 
we show how to overcome this difficulty. 

Within the norm constraint sets, we may choose S to be the ^i-unit ball and A the edge map of 

a graph G with edge set E, so that A = |a £ : j)eE ~ -M — 1 j ■ This set can be 

used to encourage sparsity patterns consisting of few connected regions/subgraphs of the graph G. 

For example if G is a ID-grid we have that A = {A e : ^27=1 ~ — 1}' so me 

corresponding penalty will favour vectors whose absolute values are constant. 

For a generic convex set A, since the penalty function is not easily computable, one needs to deal 
directly with problem i2.li . To this end, we recall here the definition of the proximity operator |[T4l . 

Definition 2.1. Let ip be a real-valued convex function on M. d . The proximity operator of 'ip is defined, 



The proximity operator is well-defined, because the above minimum exists and is unique. 

3 Optimization Method 

In this section, we discuss how to solve problem (2.1) using an accelerated first-order method that 
scales linearly with respect to the problem size, as we later show in the experiments. This method 
relies on the computation of the proximity operator of the function V, restricted to R™ x A. Since 

'More generally, method j2.2\ includes the Group Lasso method, see 1121 . 



A = {A e R% : AX e S} 



for every t £ M. d 



by prox (t) :— argmin 
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the exact computation of the proximity operator is possible only in simple special cases of sets A, 
we present here an efficient fixed-point algorithm for computing the proximity operator that can be 
applied to a wide variety of constraints. Finally, we discuss an accelerated proximal method that 
uses our algorithm. 

3.1 Computation of the Proximity Operator 

According to Definition 12.11 the proximal operator of r at (a, p) £ R" x R n is the solution of the 
problem 

min(i||G8,A)-(a,/i)||l+prC8,A) :/3eR'\Ae a) . (3.1) 



v 2 

For fixed A, a direct computation yields that the objective function in ( 13.1b attains its minimum at 

\i + P 

Using this equation we obtain the simplified problem 
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This problem can still be interpreted as a proximity map computation. We discuss how to solve it 
under our general assumption A = {A £ R" + : AX £ S}. Moreover, we assume that the projection 
on the set S can be easily computed. To this end, we define the (n + k) x n matrix B T — [I, A T ] 
and the function ip(s, t) — <f\(s) + tp%{t), (s, t) £ R" x R fc , where 
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i=l 

and ifi2{t) — Ss(t). Note that the solution of problem ( 13. 2t is the same as the proximity map of the 
linearly composite function \p o B at /x, which solves the problem 

min|i||A-/i|| 2 + ^(BA) : A e K" 

At first sight this problem seems difficult to solve. However, it turns out that if the proximity map 
of the function ip has a simple form, the following theorem adapted from [13, Theorem 3.1] can be 
used to accomplish this task. For ease of notation we set d = n + k. 

Theorem 3.1. Let be a convex function on R d , B a d x n matrix, /i £ R", c > 0, and define the 
mapping H : R d -> R d at v £ R d as 

H(v) = (I — pro X £)((7 - cBB T )v + Bp). 

Then, for any fixed point v of H, it holds that 

prox^osO) =M~ cB T v. 

The Picard iterates {v s : s £ N} C R d , starting at vq £ M. d , are defined by the recursive equation 
v„ = Hivs-i). Since the operator / — prox^ 



= H(v s -i). Since the operator / — prox^ is nonexpansiv^ (see e.g. Q), the map H is nonex- 
pansive if c £ 



n . 2 



Despite this, the Picard iterates are not guaranteed to converge to a fixed 

point of H. However, a simple modification with an averaging scheme can be used to compute the 
fixed point. 

Theorem 3.2. / 751/ Let H : R d — > M. d be a nonexpansive mapping which has at least one fixed 
point and let H K :— kI + (1 — k)H. Then, for every k £ (0,1), the Picard iterates of H K converge 
to a fixed point of H. 

The required proximity operator of ip is directly given, for every (s,t) £ R™ x R fc ,by 

prox ¥ ,(s,t) = (prox vl 0),prox V2 (t)) . 



A mapping T : R d — > R d is called nonexpansive if \\T(v) — T(v')\\2 < \\v — v'\\2, for every v, v' £ 
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Both prox Vi and prox V2 can be easily computed. The latter requires computing the projection on 
the set S. The former requires, for each component of the vector s € M. n , the solution of a cubic 
equation as stated in the next lemma. 

Lemma 3.1. For every p,, a € K and r, p > 0, the function h : R_|_ — » K defined at s as h(s) := 
(s — y«) 2 +r + /ifl^ a unique minimum on its domain, which is attained at (xq — p) + , where 
Xq is the largest real root of the polynomial 2x 3 + (r — 2(/i + p))x 2 — rot 2 . 

Proof. Setting the derivative of h equal to zero and making the change of variable x = s + p yields 
the polynomial stated in the lemma. Let x$ be the largest root of this polynomial. Since the function 
h is strictly convex on its domain and grows at infinity, its minimum can be attained only at one 
point, which is xq — p, if xq > p, and zero otherwise. ■ 

3.2 Accelerated Proximal Method 

Theorem l3 . 1 I motivates a proximal numerical approach to solving problem ( 12. U and, in turn, problem 
(O. Let E(/3) = i|jX/3-y||fand assume an upper bound L of || A T A| is knowrQ. Proximal first- 
order methods - see [6, 4l[l6][2T] and references therein - can be used for nonsmooth optimization, 
where the objective consists of a strongly smooth term, plus a nonsmooth part, in our case E and 
r + S\, respectively. The idea is to replace E with its linear approximation around a point w t 
specific to iteration t. This leads to the computation of a proximity operator, and specifically in our 
case tou t := (PtAt) <- argmin{f ||(/3, A) - (w t - ^VE(w t ))\\l + pT{fi, A) : /3 e R", A e A}. 
Subsequently, the point w t is updated, based on the current and previous estimates of the solution 
ut, Ut-i, ■ ■ ■ and the process repeats. 

The simplest update rule is u> t = u t . By contrast, accelerated proximal methods proposed by lfl6l 
use a carefully chosen w update with two levels of memory, u t , u t -\. If the proximity map can be 
exactly computed, such schemes exhibit a fast quadratic decay in terms of the iteration count, that 
is, the distance of the objective from the minimal value is O (^) after T iterations. However, it 
is not known whether accelerated methods which compute the proximity operator numerically can 
achieve this rate. The main advantages of accelerated methods are their low cost per iteration and 
their scalability to large problem sizes. Moreover, in applications where a thresholding operation is 
involved - as in Lemma [3Tl - the zeros in the solution are exact. 



Algorithm 1 NEsterov PIcard-Opial algorithm (NEPIO) 

u\,wi arbitrary feasible values 
fort=l,2,. .. do 

Compute a fixed point fiw of H t by Picard-Opial 

u t+1 «- tat ~ x V ^M - T BTi(t) 

W t+ 1 <~ Tlt+lUt+l - (7T t+ l - l)u t 

end for 



For our purposes, we use a version of accelerated methods influenced by ETI (described in 
Algorithm Q] and called NEPIO). According to Nesterov |fl6l , the optimal update is w t +\ 

u t +i + 9 t +\ (-^ — ij {ut+i — u t ) where the sequence 9 t is defined by Q\ = 1 and the recursion 

1 - Ot+i }_ 
2 2 ' 

We have adapted ||2T1 Algorithm 2] (equivalent to FISTA [4|) by computing the proximity operator 
of ^ o B using the Picard-Opial process described in Section [3T| We rephrased the algorithm using 

the sequence 7r f := 1 — 9 t + y/1 — 9t = 1 — Qt + j^- f° r numerical stability. At each iteration, the 
map Ht is defined by 

H t (v) := (i - proxi) [ (i - jBB T ^j v - jB(VE(w t ) - Lw^j ) . 



3 For variants of such algorithms which adaptively learn L, see the above references. 
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We also apply an Opial averaging so that the update at stage s of the proximity computation is 
v s+ i = KV s + (l — K)Ht(v s ). By Theorem B. II the fixed point process combined with the assignment 
of u are equivalent to u t+ \ <— prox* oS (w t — j^\7E(w t ))- 

The reason for resorting to Picard-Opial is that exact computation of the proximity operator (13.2b 
is possible only in simple special cases for the set A. By contrast, our approach can be applied to 
a wide variety of constraints. Moreover, we are not aware of another proximal method for solving 
problems ( 12. Il l or (I2.21 i and alternatives like interior point methods do not scale well with problem 
size. In the next section, we will demonstrate empirically the scalability of AlgorithmQ] as well as 
the efficiency of both the proximity map computation and the overall method. 



4 Numerical Simulations 



In this section, we present experiments with method (12. U . The goal of the experiments is to both 
study the computational and the statistical estimation properties of this method. One important 
aim of the experiments is to demonstrate that the method is statistically competitive or superior 
to state-of-the-art methods while being computationally efficient. The methods employed are the 
Lasso, StructOMP [9 | and method (12. Il l with the following choices for the constraint set A: Grid-C, 
A Q = {A : ||A\||i < a}, where A is the edge map of a ID or 2D grid and a > 0, and Tree-C, 
A = {A : AX > 0}, where A is the edge map of a tree graph. 

We solved the optimization problem (12. U either with the toolbox CVXQor with the proximal method 
presented in Section [3] When using the proximal method, we chose the parameter from Opial's 
Theorem k = 0.2 and we stopped the iterations when the relative decrease in the objective value 
is less than 1CP 8 . For the computation of the proximity operator, we stopped the iterations of the 
Picard-Opial method when the relative difference between two consecutive iterates is smaller than 
10~ 2 . We studied the effect of varying this tolerance in the next experiments. We used the square 
loss and computed the Lipschitz constant L using singular value decomposition. 



Grid-C Tree-C Difference with CVX value 




Figure 1: (Left and Centre) Computation time vs problem size for Grid-C and Tree-C. (Right) 
Difference with the solution obtained via CVX vs Picard-Opial tolerance. 



4.1 Efficiency experiments 

First, we investigated the computational properties of the proximal method. Our aim in these exper- 
iments was to show that our algorithm has a time complexity that scales linearly with the number of 
variables, while the sparsity and relative number of training examples is kept constant. We consid- 
ered both the Grid and the Tree constraints and compared our algorithm to the toolbox CVX, which 
is an interior-point method solver. As is commonly known, interior-point methods are very fast, 
but do not scale well with the problem size. In the case of the Tree constraint, we also compared 
with a modified version of the alternating algorithm of lfl2l . For each problem size, we repeated 
the experiments 10 times and we report the average computation time in Figure \Q(Left and Centre) 
for Grid-C and Tree-C, respectively. This result indicates that our method is suitable for large scale 
experiments. 

We also studied the importance of the Picard-Opial tolerance for converging to a good solution. In 
Figure[T} Right, we report the average relative distance to the solution obtained via CVX for different 

' jhttp : / / cvxr . com/cvx/| 
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Figure 2: ID contiguous regions: comparison between different methods for one (left), two (centre- 
left), three (centre-right) and four (right) regions. 



TIT 



Figure 3: Two ID contiguous regions: regression vector estimated by different models: /3* (left), 
Lasso (centre-left), StructOMP (centre-right), Grid-C (right). 



values of the Picard-Opial tolerance. We considered a problem with 100 variables and repeated the 
experiment 10 times with different sampling of training examples, considering both the Grid and 
the Tree constraint. It is clear that decreasing the tolerance did not bring any advantage in terms of 
converging to a better solution, while it remarkably increased the computational overhead, passing 
from an average of 5s for a tolerance of 10 -2 to 40s for 10~ 8 in the case of the Grid constraint. 

Finally, we considered the 2D Grid-C case and observed that the number of Picard-Opial iterations 
needed to reach a tolerance of 10~ 2 , scales well with the number of variables n. For example when 
n varies between 200 and 6400, the average number of iterations varied between 20 and 40. 



4.2 Statistical experiments 

One dimensional contiguous regions. In this experiment, we chose a model vector f}* 6 R 200 
with 20 nonzero elements, whose values are random ±1. We considered sparsity patterns forming 
one, two, three or four non-overlapping contiguous regions, which have lengths of 20, 10, 7 or 5, 
respectively. We generated a noiseless output from a matrix X whose elements have a standard 
Gaussian distribution. The estimates /3 for several models are then compared with the original. 
Figure [2] shows the model error H jrfrjj~ as tri e sample size changes from 22 (barely above the 
sparsity) up to 100 (half the dimensionality). This is the average over 50 runs, each with a different 
/3* and X. We observe that Grid-C outperforms both Lasso and StructOMP, whose performance 
deteriorates as the number of regions is increased. For one particular run with a sample size which is 
twice the model sparsity, Figure [3] shows the original vector and the estimates for different methods. 

Two dimensional contiguous regions. We repeated the experiment in the case that the sparsity 
pattern of /3* € jj20x20 cons i s t s G f 2D rectangular regions. We considered either a single 5x5 
region, two regions (4x4 and 3 x 3), three 3x3 regions and four 3x2 regions. Figure|4]shows the 
model error versus the sample size in this case. Figure [5] shows the original image and the images 
estimated by different methods for a sample size which is twice the model sparsity. Note that Grid-C 
is superior to both the Lasso and StructOMP and that StructOMP is outperformed by Lasso when 
the number of regions is more than two. This behavior is consistently confirmed by experiments in 
higher dimensions, not shown here for brevity. 

Background subtraction. We replicated the experiment from |9j Sec. 7.3] with our method. Briefly, 
the underlying model j3* corresponds to the pixels of the foreground of a CCTV image. We measured 
the output as a random projection plus Gaussian noise. Figure [6}Le/f shows that, while the Grid-C 
outperforms the Lasso, it is not as good as StructOMP. We speculate that this result is due to the non 
uniformity of the values of the image, which makes it harder for Grid-C to estimate the model. 
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Figure 4: 2D contiguous regions: comparison between different methods for one (left), two (centre- 
left), three (centre-right) and four (right) regions. 





.-A 



Figure 5: 2Z?-contiguous regions: model vector and vectors estimated by the Lasso, StructOMP and 
Grid-C (left to right), for one region (first group) and two regions (second group). 



Image Compressive Sensing. In this experiment, we compared the performance of Tree-C on an 
instance of 2D image compressive sensing, following the experimental protocol of (9). Natural 
images can be well represented with a wavelet basis and their wavelet coefficients, besides being 
sparse, are also structured as a hierarchical tree. We computed the Haar-wavelet coefficients of a 
widely used cameraman image. We measured the output as a random projection plus Gaussian 
noise. StructOMP and Tree-C, both exploiting the tree structure, were used to recover the wavelet 
coefficients from the measurements and compared to the Lasso. The inverse wavelet transform was 
used to reconstruct the images with the estimated coefficients. The recovery performances of the 
methods are reported in Figure ® Right, which highlights the good performance of Tree-C. 




Sample size Sample size 

Figure 6: Model error for the background subtraction (left) and cameraman (right) experiments. 



5 Conclusion 



We proposed new families of penalties and presented a new algorithm and results on the class of 
structured sparsity penalty functions proposed by [ \2\. These penalties can be used, among else, to 
learn regression vectors whose sparsity pattern is formed by few contiguous regions. We presented 
a proximal method for solving this class of penalty functions and derived an efficient fixed-point 
method for computing the proximity operator of our penalty. We reported encouraging experimen- 
tal results, which highlight the advantages of the proposed penalty function over a state-of-the-art 
greedy method [9 |. At the same time, our numerical simulations indicate that the proximal method 
is computationally efficient, scaling linearly with the problem size. An important problem which we 
wish to address in the future is to study the convergence rate of the method and determine whether 
the optimal rate O(^) can be attained. Finally, it would be important to derive sparse oracle in- 
equalities for the estimators studied here. 
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